首页> 外文OA文献 >Exploiting random projections and sparsity with random forests and gradient boosting methods - Application to multi-label and multi-output learning, random forest model compression and leveraging input sparsity

【2h】

Exploiting random projections and sparsity with random forests and gradient boosting methods - Application to multi-label and multi-output learning, random forest model compression and leveraging input sparsity

机译：利用随机森林和梯度提升方法开发随机投影和稀疏度-应用于多标签和多输出学习，随机森林模型压缩和利用输入稀疏度

代理获取

本网站仅为用户提供外文OA文献查询和代理获取服务，本网站没有原文。下单后我们将采用程序或人工为您竭诚获取高质量的原文，但由于OA文献来源多样且变更频繁，仍可能出现获取不到、文献不完整或与标题不符等情况，如果获取不到我们将提供退款服务。请知悉。

页面导航

摘要
著录项
引文网络
相似文献
相关主题

摘要

Within machine learning, the supervised learning field aims at modeling the input-output relationship of a system, from past observations of its behavior. Decision trees characterize the input-output relationship through a series of nested ``if-then-else'' questions, the testing nodes, leading to a set of predictions, the leaf nodes. Several of such trees are often combined together for state-of-the-art performance: random forest ensembles average the predictions of randomized decision trees trained independently in parallel, while tree boosting ensembles train decision trees sequentially to refine the predictions made by the previous ones.The emergence of new applications requires scalable supervised learning algorithms in terms of computational power and memory space with respect to the number of inputs, outputs, and observations without sacrificing accuracy. In this thesis, we identify three main areas where decision tree methods could be improved for which we provide and evaluate original algorithmic solutions: (i) learning over high dimensional output spaces, (ii) learning with large sample datasets and stringent memory constraints at prediction time and (iii) learning over high dimensional sparse input spaces.A first approach to solve learning tasks with a high dimensional output space, called binary relevance or single target, is to train one decision tree ensemble per output. However, it completely neglects the potential correlations existing between the outputs. An alternative approach called multi-output decision trees fits a single decision tree ensemble targeting simultaneously all the outputs, assuming that all outputs are correlated. Nevertheless, both approaches have (i) exactly the same computational complexity and (ii) target extreme output correlation structures. In our first contribution, we show how to combine random projection of the output space, a dimensionality reduction method, with the random forest algorithm decreasing the learning time complexity. The accuracy is preserved, and may even be improved by reaching a different bias-variance tradeoff. In our second contribution, we first formally adapt the gradient boosting ensemble method to multi-output supervised learning tasks such as multi-output regression and multi-label classification. We then propose to combine single random projections of the output space with gradient boosting on such tasks to adapt automatically to the output correlation structure.The random forest algorithm often generates large ensembles of complex models thanks to the availability of a large number of observations. However, the space complexity of such models, proportional to their total number of nodes, is often prohibitive, and therefore these modes are not well suited under stringent memory constraints at prediction time. In our third contribution, we propose to compress these ensembles by solving a L1-based regularization problem over the set of indicator functions defined by all their nodes.Some supervised learning tasks have a high dimensional but sparse input space, where each observation has only a few of the input variables that have non zero values. Standard decision tree implementations are not well adapted to treat sparse input spaces, unlike other supervised learning techniques such as support vector machines or linear models. In our fourth contribution, we show how to exploit algorithmically the input space sparsity within decision tree methods. Our implementation yields a significant speed up both on synthetic and real datasets, while leading to exactly the same model. It also reduces the required memory to grow such models by exploiting sparse instead of dense memory storage for the input matrix.

机译：在机器学习中，监督学习领域旨在根据过去对系统行为的观察来对系统的输入输出关系进行建模。决策树通过一系列嵌套的``if-then-else''问题（测试节点）来表征输入-输出关系，这些问题是测试节点，从而导致了一组预测（叶子节点）。通常将几种这样的树组合在一起以实现最新的性能：随机森林集合对并行独立训练的随机决策树的预测求平均，而树增强集合对序列决策树进行顺序训练以完善先前决策的预测新应用程序的出现要求在不牺牲准确性的前提下，就输入，输出和观察的数量而言，在计算能力和存储空间方面具有可扩展的监督学习算法。在本文中，我们确定了可以改进决策树方法的三个主要领域，我们为其提供和评估原始算法解决方案：（i）在高维输出空间上学习，（ii）在预测时使用大样本数据集和严格的内存约束进行学习时间和（iii）在高维稀疏输入空间上学习。解决具有高维输出空间的学习任务的第一种方法（称为二进制相关性或单个目标）是为每个输出训练一个决策树集合。但是，它完全忽略了输出之间存在的潜在关联。假设所有输出都相关，则称为多输出决策树的另一种方法适合同时针对所有输出的单个决策树集合。然而，这两种方法都具有（i）完全相同的计算复杂度和（ii）目标极端输出相关结构。在我们的第一篇论文中，我们展示了如何将输出空间的随机投影（一种降维方法）与随机森林算法相结合，以减少学习时间的复杂性。精度得以保留，甚至可以通过达到不同的偏差-方差折衷来提高精度。在我们的第二个贡献中，我们首先正式将梯度增强集成方法应用于多输出监督学习任务，例如多输出回归和多标签分类。然后我们建议将输出空间的单个随机投影与梯度提升结合起来以自动适应输出相关性结构。由于有大量观测值，随机森林算法通常会生成大型的复杂模型集合。但是，此类模型的空间复杂度与它们的节点总数成正比，通常是令人望而却步的，因此，在预测时，这些模式不太适合在严格的内存约束下。在我们的第三项贡献中，我们建议通过解决所有节点定义的指标函数集上的基于L1的正则化问题来压缩这些集合。一些监督学习任务具有高维但稀疏的输入空间，其中每个观察结果只有一个非零值的少数输入变量。与其他有监督的学习技术（例如支持向量机或线性模型）不同，标准决策树实现不适用于处理稀疏的输入空间。在我们的第四篇文章中，我们展示了如何在决策树方法中算法地利用输入空间稀疏性。我们的实现大大提高了合成数据集和真实数据集的速度，同时导致了完全相同的模型。通过为输入矩阵利用稀疏而不是密集的内存存储，它还减少了扩展此类模型所需的内存。

著录项

作者
Joly, Arnaud;
展开▼
作者单位

展开▼
年度 2017
总页数
原文格式 PDF
正文语种 en
中图分类

相似文献

外文文献
中文文献
专利

1. Assessing the Applicability of Random Forest, Stochastic Gradient Boosted Model, and Extreme Learning Machine Methods to the Quantitative Precipitation Estimation of the Radar Data: A Case Study to Gwangdeoksan Radar, South Korea, in 2018 [J] . Ju-Young Shin, Yonghun Ro, Joo-Wan Cha, Advances in Meteorology . 2019,第2期

机译：评估随机森林，随机梯度提升模型和极端学习机方法对雷达数据定量降水估计的适用性：以韩国光德山雷达的案例研究，2018年
2. Sparse Projection Oblique Randomer Forests [J] . Tyler M. Tomita, Mauro Maggioni, Joshua T. Vogelstein, Journal of machine learning research . 2020,第a期

机译：稀疏投影倾斜随机森林
3. Forecasting of Real GDP Growth Using Machine Learning Models: Gradient Boosting and Random Forest Approach [J] . Yoon Jaehyun Computational economics . 2021,第1期

机译：使用机器学习模型预测实际GDP增长：梯度提升和随机森林方法
4. Random Forests with Random Projections of the Output Space for High Dimensional Multi-label Classification [C] . Arnaud Joly, Pierre Geurts, Louis Wehenkel European conference on machine learning and knowledge discovery in databases . 2014

机译：高维多标签分类的输出空间随机投影的随机森林
5. Random forests and the data sparseness problem in language modeling. [D] . Xu, Peng. 2005

机译：语言建模中的随机森林和数据稀疏问题。
6. Comparison between Random Forests Artificial Neural Networks and Gradient Boosted Machines Methods of On-Line Vis-NIR Spectroscopy Measurements of Soil Total Nitrogen and Total Carbon [O] . Said Nawar, Abdul M. Mouazen 2017

机译：Vis-NIR光谱在线测量土壤总氮和总碳的随机森林人工神经网络和梯度提升机方法的比较
7. Assessing the Applicability of Random Forest, Stochastic Gradient Boosted Model, and Extreme Learning Machine Methods to the Quantitative Precipitation Estimation of the Radar Data: A Case Study to Gwangdeoksan Radar, South Korea, in 2018 [O] . Ju-Young Shin, Yonghun Ro, Joo-Wan Cha, 2019

机译：评估随机森林，随机梯度提升模型和极端学习机方法对雷达数据定量降水估计的适用性：以韩国光德山雷达的案例研究，2018年

Exploiting random projections and sparsity with random forests and gradient boosting methods - Application to multi-label and multi-output learning, random forest model compression and leveraging input sparsity

摘要

著录项

引文网络

相似文献

相关主题

期刊订阅